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In conclusioDy ihercforc> the reference bias shown in this srvdy 
4 teems to be real. Such a finding has imponani implicadons, since 

there is DO reason to believe that rheumatologists arc more biased 
than others in scieciing:rcfcrcnces. A reader tracing the Litcrarurc on 
any new drug using the reference lists given in the aruclcs might hsk 
obtaining a biased sample. RcfcrcDcc bias has another serious 
implication: it may render the conclusion of the individual article 
Jess reliable. Is this also true for review articles, and for other . 
disciplines in medicine? 


The study was supponed by a grant from the Danish Medical Research 
CoundJ. I thank the University’ Library 11 s Copenhagen» the medical 
companies, and Alice Nerhede, librarian at Her lev Hospiial; for help in data 
collection; Dr John Anderson for linguistic help, and^ especially, Dr 
Tborkild I A ^renseni liver unit, Hvidovre Hospital, for his valuable 
mggestioni and comments on the manuscript. 
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Towards a reduction in publication bias 

ROBERT G NEWCOMBE 


Abstract 

Current practice resuhs in the publication of many research 
studies in medical and related disciplines which may be criticised 
on the grounds of inadequate sample size and statistical power. 
Small studies continue to be carried out with little more than a 
blind hope of showing the desired effect. Nevertheless, papers 
based on such work are submitted for publication, esp>eciaUy if 
the results turn out to be statistically tig;mficant. There is 
confusion about what makes a result suitable for publication. 
Often there is a preference for statistically significant results at 
the peer review stage. Consequently published reports of small 
studies tend to contain too many faise positive results and to 
exaggerate the true effecu. 

Tbe use of a criterion of a posterk>ri power does not eiiznisate 
the bias; a priori power is tbe criterion of choice. This could be 
implemeotcd by peer review of study protocols at the plaxming 
stage by funding bodies and journals. 


Introductioo 

Profoii id biological and behavioural i differences berween human 
beings mean that siaiisucal methods have to be used in presenting 
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medical research findings in an unbiased way. Hence statisticians 
have devised methods of estimation and significance testing, which 
are now widely used. Nevertheless, though the znathemaiical 
•speas of these methods are acctpuble, what is done with the 
results commonly leads to serious selection bias. An article tha: 
reports a statistic^y significant difference between two treatments 
is more likely to be published than one which docs not. Many 
research studies have inadequate numbers of subjects, and signifi¬ 
cance can be attained only if chance conveniently exaggerates the 
difference. 

So long as statistical significance is used as a major criterion of 
acceptability for publication the published results of medical 
research will contain a high proportion of false positive results. 
Thus quantiutive estimates of treatment efTects taken from 
published work cannot be regarded as 6ce from bias. There arc 
established methods to calculate the power of a study, which is the 
probability of detecting a specified, important difTerence using a test 
with a set significance level. The inierprciation of sutistical power is 
satisfactory only when it is calculated with values specified at the 
design sugc of the study. The proper method to assess the adequacy 
of the sample size is by peer review values specified in the 
protocol. If this is done the significance level eventually anained is 
DO longer relevant to selection for publication. 


Importance of sample size 

Manuscripts submitted to medical'journals often contain serious 
statistical faults.’ Various steps have been taken to remedy ttus, 
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noubly the checklists used by the and there is now also an 

increased' awareness of the need for therapeutic efficacy, to be 
evaluated with randomised controUed trials. Ncvenhclcss» power 
calculations arc still rarely used.^ 

Conventional significance testing (table I) leads to great emphasis 
on the type I eiror rate a, but the type II error rate p and its 
complement, the power 1 - though verynmportant, arc neglected;^ 
In particular, in a-clinical trial the number of subjects required 
depends on the a and p levels chosen, the treatment difference of 
interest, and the degree to which the treatment effect varies between 
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subjects. The choice of the first three of these is somewhat arbitrary, 
and the fourth may be difficult to estimate. Nevenhcless, the study 
is likely to be valid only if values arc chosen for these parameters and: 
the resulting sample size requirement dcicrniincd, whether by the 
use of formulas,* diagrams,’ or ublcs.* 

The most obvious consequence of an inadequate sample size is 
that investigators may well not show a ciinically important effect. 
Such a false negative result, if propagated by publication, is apt to be 
widely misinterpreted as a demonstration that there is no difference 
between the treatments- This has provoked two responses among 
those who decide what is to be publishcdi Firstly, sutisticians 
advocate a shift of emphasis away from significance testing and 
towards estimation and confidence intervals.* A wide confidence 
interval'is understood as implying that lirge, potentially important 
differences cannot be ruled out. The confidence interval approach 
may also help in a wider context—for instance, in showing that the 
results of two apparently disparate studies are not incompatible, the 
truth perhaps ^ing somewhere berween their two estimates. 

The second response is to exclude small! studies, with high p, 
from publication. There arc three approaches in which this may be 
done. Firstly, attainment of a desired level of significance may be 
used as a criterion. This seems plausible because, for a fixed a, both 
the attained significance level p and the type II error rate p reflect the 
sample size. Nevertheless, p (unlike a) depends on sampling 
variation and the use of this criterion leads to publication bias. 
Secondly, assessment based on sutistical'power caJculated from the 
data gives the appearance of greater soundness; it docs not fall into 
the obvious trap of the first approach and is based on data rather 
than on the uncertainty of a prior targeted difference. In reality, 
however, the requirement of significance using an a level of 0-05 and 
an a posteriori of 0*2 amounts to nothing more than statistical 
significance at a more stringent level ofa^O OOS and thus also docs 
not avoid publication bias. This kind of pvaiiie is akin top, not too, 
and includes sampling variation. The third approach is to impose 
the requirement of an adequately low value assessed a priori; this 
does not lead to bias since the p value is not subject to sampling 
variation. 

Thus results based on studies which had a poor prospect of 
yielding useful information may justifiably be rejected, but only if 
the cntcrion is based on power assessed a priori. 


Nature and consequences of publication bias 

Publication bias may be defined simply: significant results arc 
preferred for publication i Attention was drawn to it as early as 1963* 
and it has been “rediscovered” several times since. Suppose the a 
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rate chosen is0 05. Then, just 5% of studies in w'hich is valid will 
yield a test sutisiic significant at the 5% level. If attention is limited 
to studies that attain publication, however, the proportion of such 
false positive results is higher. The significance testing paradigm 
docs not permit us to say what proportion of stausticaliy significant 
results are false positives, but the effect of publication bias is to 
make this proportion disquietingiy larger than it would otherwise 
be. 

Correspondingly, studies selected for publication tend to contain 
exaggerated csiimates of the main effects, and trials with truly 
modest treatment cffecu will achieve sutisticai significance only 
if random variation conveniently exaggerates these effects.’ 
Conversely, variation is underestimated. These biases operate more 
strongly the more inadequate the sample size. A study with low 
power, where the true treatment effect is zero or small, must grossly 
exaggerate it (by chancc)To show significance and attain a prospect 
of publication. False positives and exaggerated estimates may well 
dominate much of medical publication. This phenomenon is likely 
to contribute to the disparity commonly found in the results of 
different studies, which leads to controversy instead of well 
established, consistent findings. The desire to minimise the impact 
of false positive assertions may result in a preference for publishing 
findings which refute a previous claim, rather than confirmatory 
results—a further source of bias. 

Such selection bias may equally be introduced by the editorial 
team (editorial selection bias) or by the researcher or supervisor or 
bead of department (submission selection bias): At e^ stage a 
significant result may be construed as particularly encouraging and 
failure to attain significance as correspondingly discouraging. This 
operates in addiuon toany biases introduced because of prejudice.^ 

Publication bias continues to arise only because two conditions 
hold: the criteria for selecting studies for publication kre inadequate, 
and many studies performed and submitted for publication have 
been done on small numbers of subjects. Significance testing, the 
dzne honoured framework for inductive inference, is evidently 
deficient as a selection criterion* Nevertheless, the confidence • 
interval approach incurs the same danger of publication bias: 
studies in which the confidence interval for the size of the effect 
excludes zero are likely to be preferred for publication—a condition 
that is equivalent to statistical significance. It has been asserted that 
overconcentration on simplistic significance testing is responsible, 
for most of the ill based criticisms of small trials." The more careful 
approach using confidence intervals overcomes many of the 
difficulties. But so long as confusion remains as to what constitutes a 
result warranting publication a bias will ensue from submission and 
editorial selection processes. 

The other prerequisite for publication bias is the widespread use 
of inadequate sample sizes. T^c other consequence of this is that a 
doctor seeking information to guide a clinical decision is confronted 
with a bewildering variety of conflicting claims. To remedy this 
dilemma “meu-anaJyscs” or “overviews” have been constniaedy 
which fit toget her results of several studies and seek to make the best 
use of data from studies which would otherwise yield little infor¬ 
mation. Nevenhcless, published studies are stilla biased sampleof all . 
the relevant work that has been done. The only prospect of eliminat¬ 
ing this bias is to contact all in vestigators who may ha ve done relevant 
work and ask for their unpublished d^ta. lain Chalmers and 
Thomas Chalmers arc pursuing this goal'in connection with the 
Oxford Database of Perinatal Trials, and their work should provide 
some evidence on the quantity of “negauve" studies that cither 
never get written up or never get published. 

The high prevalence of small studies stems from the way that 
research is organised. Much material submitted for publicaaon has 
come from studies that arc regarded as the work of an individual 
researcher, performed! within severe constraints of time and 
resources; often there is little more than a blind hope that the 
desired effect will lx shown. Research output remains a major 
criterion for assessing candidates for promotion and so on, even 
though it is widely recognised to be deficient . When research output 
is equated with publication,; however, the consequences for 
the standards of published work arc grave. The constraints an 
individual investigator faces often preclude obtaining results of 
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external vaiidity, but pubbeation in a highly regarded, widely 
* circulated journal implies such validity, however misuken this is 
given the background of inadequate statistical power. 

Thus the researcher faces a dilemma: on the one hand, most 
studies he can perform will need the collaboration of others to attain 
adequate statistical power; on the other hand, any collaborative 
study (even if it is feasible) will deprive him of personal kudos. Only 
those who arc remote from the researcher’s dilemma—journal 
editors and referees, funding bodies, and (to a lesser degree) ethical 
committees-~can uphold the highest scientihe standards with no 
conflict of loyalties. These agents are not obliged to accept the status 
quo and can refuse to suppon or publish inadequate research. 1 
regard it as their prerogative, if not obligation, to do sOi 

A ladicaJ proposal 

Selection of work for funding or publication, then, should 
primarily be based on reasonableness a priori: Has the design 
adopted (explicitly or implicitly) a good prospect of yielding useful 
information? “Design” here includes the study idea, scientific basis, 
clinical relevance, originaliTy, and so on^ as well as the study’s 
structure and the number of subjects. If all this is satisfied then the 
paper should be published irre s pective of whether sutisticali 
significance or the targeted size of difference was attained. The 
difference actually observed is irrelevant to the decision (see 
Mahoney,p 163). The assessment of scientific validity would 
therefore be the same, whether carried out before the study or after 
it. The only additional requirement a posteriori is adequate 
adherence to the protocol—in panictilar, attainment of the planned 
sample size. 

The consequences of this shift in emphasis to a priori criteria arc 
most important in the case of studies of inadequate power. Table 11 
contrasts what would happen to the results of these studies under 
the proposed rule with what is likely to happen at present. The 
publication of “positive” findings would be inhibited. The ad- 
-vantage would be the exclusion of false positives from inadequate 
studies, with their grossly exaggerated estimates of differences. 
Against this must be weighed the cost of failing to publish true 
positives—^which would occur quite often Ol —P=0*5), but which 
are based on inadequate evidence and alk> overestimate the 
difference. 

Application of this principle to studies with adequate power 
would lead to more widespread publication of negative results 
(uble III), True negative results would be salvaged from studies of 
accepuble power—though these might currently be accepted 
anyway, especially if suppltmenied with confidence intervals. This 
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would be at the costof publishing studies with false negative results, 
though these would not be too frequent (P=OT). 

Both journal editors and funding bodies can and should require 
specification of siaiisiical power. They should require that a 
protocol or a write up should describe clearly the details of the 
design of the study—in paniculkr , the following: 

(j) the struonirc; 

(fc) the choice of the most appropriate criterion variable on which 
to base the power calculation and the most appropriate groups to be 
compeared; 

(c) the size of the effect to be reliably detected and (except in the 
case of a binary variable) how much this cffca varies between 
subjects; 
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(d) the sample size (specifying accrual rate and penod) aimed at, 
with specific allowance for expected dropouts; 

(e) consequent statistical power and the method by which it was 
derived. 

These parameters should be idcndcal in the protocol and in the 
eventual study report. The same criterion should be used to assess 
validity at both stages—in particular, the write up should be 
assessed on the basis of the values laid down before any data were 
collected. The only additional requirements at the publication stage 
would be the completion of the study as laid down in the protocol , 
with full information on as many subjects as were contracted for; 
variability in response between subjects not grossly in excess of 
that planned for; and the usual standards of adequate analysis, 
inference, and discussion. 


TABLE iih—Consequertees of a shift to assessment by o pnon 
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This approach entails assessment of the parameters assumed on 
an a priori basis; they arc to be judged in the bghi of knowledge 
current at the time the study was designed i Other results coming to 
light during the study should not be allowed to affect the judgment 
of validity (chough occasionally a major advance occurring during 
this period <may render the results no longer relevant). 

Journal editors as well as grant awarding bodies could ufiplement 
this proposal most effectively by requiring submission of protocols 
for peer review at the planning stage. In cither case an independent 
review body could be used. Specialists in the subject could assess the 
reasonableness of the values suppbed for the parameters on which 
the power caJculaiion is based (particuiaily the smabest clinically 
important difference), and the verification of the power calculation 
would not be a formidable task for a statistician or other assessor 
famibar wih this. These assessments, once performed for the 
protocol, would' not need to be repeated for the write up. 
Ck>Dsequently, having accepted a protocol as adequate and relevant, 
i jour^ could offer eventual pubbeation^ conditional i only on 
completion of the study in adequate conformity to the protocol 
together with the usual requirements of adequate analysis, in¬ 
ference, and discussion. It would become normal practice to accept 
an article only if this had been done . 

The work of Mahoney suggests that reviewers may find it difficult 
to comment on incomplete manuscripts. •• Ncvcrtheltss, Mahoney’s 
study is not an ideal model for the process I advocate, for two 
reasons. Firstly, his reason for the incompIcicDcss of the manuscript 
was inadequate. It would be understood, however, that the material 
to be evaluated was only a protocol, even though it would be 
vinually unaltered in the eventual article—and this would become 
an accepted clement of peer review (as it is, to a limited extent, with 
funding bodies). Secondly, Mahoney studied psychologists known 
to have entrenched, diametrically, opposite bebefs, to a degree (I 
hope) not encountered often among doctors; knowing that results 
wouldi shorily be disclosed, they would be rclucianii to commit 
themselves unequivocally to a favourable stance, lest the results 
turned out to contradict their chosen position. At the suge of review 
of a protocol this possibility is more remote. 

To pul these recommendations into practice would be more 
feasible for formal, well structured study designs, such as the 
clinical trialv than for less formal explanatory’ work—for which the 
rationale of significance testing is more contentious. Like other 
alterations in editorial pobey', this would best be introduced as a 
decisive change, as from a given date, with advance indication 
given, as a piecemeal! approach to change is unlikely to work.‘’ I 
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hope that cnlighicned editon will take up the challenge; the lead 
must come from an established, prestigious ioumai that can afford 
to be choosy; 


CODClusiOD 

Publication bias is endenuc and will remain so as long as the 
sample sizes commorJy used in research arc too small and the 
methods used to assess adequacy of sample size are deficient. 
Assessment by a priori criteria—in particular, systematic peer 
review at the piaiming suge—would result in a much titter 
measure of control over the quality of published work, with the 
prospea of improvement in study design in general and statistical 
power in particular. 

1 thank several colleagues, e^peciaUy £>r Edward C Coles and the BMJ 
ediioriai team and the referee, for coDstnictive comments. 
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Medicine and the Media 


A t the annual scientific meeting of the British Paediatric 
Association last year the prize for the best paper presented by a 
young paediatrician went to a member of a research group from 
Oxford. Papers offered for the annual meeting are examined by the 
tssociadon’s academic board not only for their scientific worth but 
also for adherence to ethical standards. This paper, later published 
in the Lancet y has now been condemned by certain sections of the 
press and by a group of members of parliament. What was the work 
so condemned? 

Preterm infants of low birth weight live at considerable risk, 
particularly of cardiorespiratory failure, and the risk is increased if 
they have to undergo an operation. ClinicaJ experience suggested 
that deep anaesthesia and narcotic analgesics would increase the 
risk. That and the belief that such infants have a poor perception of 
pain because of lack of myelinisation in the central nervous system 
led CO the conventiooLal practice of axiaesthesia with nitrous oxide 
and muscle reiaxants combined with artificial vendUtion. In a study 
of 40 published reports the Oxford team found that three quarters of 
newborn babies undergoing surgical ligation of patent ductus 
arteriosus had received muscle reiaxants alone or with nitrous oxide. 

In the preterm infant with a poor or absent ability to cry it 
is difficult to tell clinically whether pain and stress are b^g 
experienced, but newer biochemical methods that detect hormones 
and intermediary meubolites associated with stress now make the 
assessment of stress xixire possible and prompted a re-exaznination 
of the problem by the Oxford team. The team wanted to find out 
whether adding aUnle narcotic analgesic to the accepted anaesthetic 
regimen might prove beneficial rather than harmful. Uring these 
metabolic methods, they therefore compared the response to 
surgical ligation of patent ducrus arteriosus carried out under the 
conventional regimen with and without the narcotic analgesic 
fencanyl. The possibility that fentanyl might adversely affect 
respiration and circtiiation postoperacively was also studied i 

A randomised' trial was designed with help from the National 
Perinatal Epidemiology, Unit in Oxfbrd to ensure that the results 
were sutistically valid and that a meaningful result would be 
recognised as soon as possible. After only eight babies in each group 
had been operated on the results showed that the new regimen was 
significantly superior to the old not only in reducing the stress 
response estimated biochemically but also in improving the 
posiopcrauvc suic. Thus for the first time good scientific evidence 
was produced of the need to provide deeper anaesthesia during 
operations on these tiny infants. 


This research was commended by the distinguished American 
paediatrician Dr William Silverman, author of the widely acclaimed 
book Human Expenmeniation: A Guided Step Into the Unknown,^ He 
wrote that the Oxford workers *‘de$erve a loud vote of thanks for the 
ethically sound effort to subject to a rigorous test opinion based on 
long standing practice. And their call for further study should 
not fall on d^ ears. It is indeed urgent to determine the 
pathophysiological consequences of unrelieved pain and sufifering 
inflicted during everyday care of newborn babies.” 

Mcmbcis of the British Paediatric Association were thus amazed 
and the doctors who had done the work bewildered and distressed 
when after a distorted report in the Daily Mail entitled, “Pain-killer 
shock in babies' operations” (8 July) this work became the subject 
of a condemnatory “press release: for immediate publication” 
issued by some members of parliament forming the All Party 
Parliamentary Pro-Life Group. The Lancet article appeared in 
January, the story in the Deify Mail in July^ and the press release 
fh>m the members of parliament in August. The press release 
was entitled “Inhumane baby operations slammed” and the first 
paragraph suced: 

“Fourteen members of parliament have demanded an inquiry 
into trials in which sixteen premature babies were given open hcan 
surgery, eight of them without the use of pain killers to test whether 
or not the babies could experience pain.” 

The press release then said that the General Medical Council was 
being asked to investigate these trials with a view to bringing those 
responsible before its disciplinary committee. It condnued: 

“In a sutement Sir Bernard Braine said: 

The trials seemed to us to be even more barbarous when one 
considers that the babies being tested for pain were given curare, a 
paralysing drug, so that they would have been unable to kick or 
struggle even if they were in agony, the obvious intention being to 
keep them immobile at all costs throughout the operation. Apart 
from this they were given only nitrous oxide (laughing gas).*” 

Implying misleadingly that wisdom acquired from the research 
existed before it was carriedbut the sutement went on: 

“Not surprisingly post-operatively they fared far worse than the 
eight babies who were given pain killers. Two of the disadvantaged 
babies suffered from hypotension,, two showed poor peripheral 
circulation—both of which can be indications of shock which most 
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